A Best-Match Algorithm for Broad-Coverage Example-Based Disambiguation
To improve tit(.' coverage of examl)le-bases , two nlethods are introduced into the 1)est-match algor i thm. The first is for acquiring conjunctive relat ionships fl'om corpora, as measures of word similarity t h a t can be used in addit ion to thesauruses. The Second, used when a word does not appear in an examltled)asc or a thesaurus, is for inferring links to words in the examph>base by ( 'mnparing the usage of the word in the text ~md tha t of words in the examplebase. 1 I n t r o d u c t i o n Improvement of cow, rage in practical domains is one of the most impor tan t issues in the area of examplebased systems. The examl)le-based apI)roach [6] has become a (:amman technique for m~turM language processing apI)lications such as machine t ranslat ion *rod disambiguatkm (e.g. [5, 10]). However, few existing systems can cover a practical domain or handle a l)road range of phenomena. The most serious obstacle to robust examplebased systems is the coverage of examt)le-bases. It is an oi)en question how many e~xaml)les are required for disambiguat ing sentences in a specific domain. The Sentence AnMyzer (SENA) wax developed in order to resolve a t tachment , word-sense, and conjunctive anlbiguitics t)y using constraints and example-based preferences [11]. It lists at)out 57,000 disambiguated head-modifier relationships and al)out 300,000 synonyms and is-a 1)inary~ relationships. Even so, lack of examl)les (no relevant examlfles ) accounted for 46.1% of failures in a experiment with SENA [12]. Previously, it was believed to be easier to collect examples than to develop rules for resolving ambiguities. However, the coverage of each examltie is nmch nlore local than a rule, and therefore a huge munber of examt)les is required in order to resolve realistic 1)rot)lems. There has been some carl)uSbased research (m how to acquire large-scah~ knowledge automati(-ally in order to cover the domain to be disambiguatcd, lint there are still major 1)rot)l c n l s t o ])e o v e r e o n l e . First, smmmtic kvowledge such as word-sense cannot be extracted by automat ic cort)u~-base(l knowledge, acquisition. The example-base in SENA is deveh)l)ed by using a bootstr~q)ping method. However, the results of word-sense disambiguat ion nmst be (:he(:ked by a hutnan, a,nd word-senses are tagged to only about ;t half of all the examt)les , since the task is very time-consmning. A second ditliculty in the exalnple-t)ased attproach ix the algori thm itself, namely, the be.stmatch algorithm, which was used in earlier systems built around a thesaurus t ha t consisted of a hierttrchy of is-a or synonym relationships between words (word-senses). This paper proposes two methods for ilnproving the coverage of exantple-bases. The selected domain is th~tt of sentences in comt)uter manmds. First, knowledge thtd; represents a type of similarity other than synonym or is-a relationships is a(> quired. As one measurement of the similarity, interchangeability between words (:~m be used. In this paper, two types of the relationship reflect such interchangeability. First, the elements of coordinated s t ructures are good clues to the interchangeat)ility of words. Words can be extracted easily from a dolnain-specitic carl)us , and therefore the examplebase can I)e adapted to the sl)ecific domain by using the domain-specific relationships. If there are no examples and relations in the thesaurus, the example-base gives no information for disambiguation. However, the text to be disam1)iguate.d provides useful knowledge for this purpose [7, 3]. '['he relationshit)s between words in the example-base and ;ut unknown word can be guessed by comi)aring tha t word's usage in extracted cxantples and in the text. 2 A B e s t M a t c h A l g o r i t h m In this section, conventional algori thms for exami)le-b~tsed disalnl)iguation~ art(1 their associate(i prol)lems, a.re briefly introduced. The algori thms of lnost examph>l)ased systems consist of the following three steps~: till some systenls, the exac t -mah :h ttl|(I Lhe bes t -ma tch ~tr(! ll/orge({.
